Exploring the limits of safety analysis 
in complex technological systems 

D. Sornettc 

Department of Management, Technology and Economics, 
ETH Zurich, Scheuchzerstrasse 7, CH-8092 Zurich, Switzerland 

T. MaillartQ 

Department of Humanities, Social and Political Sciences, 
ETH Zurich, Ramistrasse 101, CH-8092 Zurich, Switzerland 

W. Krogeill 

Risk Center, ETH Zurich, Scheuchzerstrasse 7, CH-8092 Zurich, Switzerland 

(Dated: July 25, 2012) 



From biotechnology to cyber-risks, most extreme technological risks cannot be 
reliably estimated from historical statistics. Engineers resort to probability safety 
analysis (PSA), which consists in developing models to simulate accidents, poten- 
tial scenarios, their severity and frequency. However, even the best safety analysis 
struggles to account for evolving risks resulting from inter-connected networks and 
cascade effects. Taking nuclear risks as an example, the predicted plant-specific dis- 
tribution of losses is found to be significantly underestimated when compared with 
available empirical records. A simple cascade model suggests that the classification 
of the different possible safety regimes is intrinsically unstable in the presence of 
cascades. Even the best probabilistic safety analysis requires additional continuous 
validation, making the best use of the experienced realized incidents, near misses 
and accidents. 



* Electronic address: ldsornette@ethz . chl 
Electronic address: Itmaillart @ethz . ch\ 
■'■Electronic address: |kroeger@mavt.ethz.ch 



2 



Most innovations are adopted on the premise that the upside gains largely make up for the 
downside short- and long-term risks, in particular through the adoption of safety measures 
aiming at preventing or mitigating potential losses. But innovations are often disruptive and, 
by essence, break new ground. This implies that history is a poor guide for risk assessment 
due to the novelty of the technology and the corresponding insufficient statistics. For highly 
technical enterprises for which full scale experiments are beyond reach (such as the Internet 
and grid technology and the associated cyber-risks, technological and population growth and 
climate change, financial innovation and globalization and the dangers of systemic banking 
crises), engineers resort to probability safety analysis (PSA), which consists in developing 
fault tree models to simulate accidents, their different triggers and induced scenarios, their 
severities as well as their estimated frequency jl, 2|. When performed as an on-going process 
continuously informed by new examples that is refined accordingly, plant-specific safety 
analysis has proved very useful for the implementation of ever better safety barriers, keeping 
the advantages of the technology while reducing its undesirable dangers. 

Here, we question the adequacy of this commonly used approach for the prevention of 
extreme events. Let us take the example of the nuclear power plant industry, which has 
been a leader in the development of state-of-the-art safety analysis, with outstanding efforts 
aimed at preventing incidents from becoming major accidents. Unfortunately, PSA has been 
unable to prevent large catastrophes such as Three Mile Island (USA, 1976), Chernobyl 
(former Soviet Union, 1986) and Fukushima-Daiichi (Japan, 2011). Figure 1 (A) shows 
a one-to-one comparison of the probability distributions of damage respectively generated 
from PSA simulations (Farmer curve of the 1950s and Rasmussen curve of the 1970s) 



and from historical records of the 
different kinds of nuclear facilities 



argest incidents and accidents from 1952 to 2011 in 
^. The PSA curve predicts a rather thin tail while 
the empirical distribution of losses is fat-tailed, a power law Pr(loss > S) = C / with 
exponent /i ~ 0.7 (See SM). Concretely, this means that, for nuclear power events with 
damage costing more than one billion dollars, their frequencies are underestimated by two 
orders of magnitude. Moreover, rather than being associated with just a few extreme cases, 
the power law distribution of losses suggests that the problem has intrinsic structural roots. 

We propose to rationalize this discrepancy between predicted and realized losses by rec- 
ognizing that most important industrial accidents involve cascades with inter-dependent and 
mutually amplifying effects js], which are modeled in probability safety analysis using fault 



and event tree techniques Thus, in most situations, one can model an accident as a 
succession of unbroken and broken safety barriers with increasing damage. Let us consider 
the simplest coarse-grained model for such cascades and use it to translate the discrepancy 
shown in Figure 1 (A) into technically meaningful insights. Starting from an initiating 
event generating a damage 5*0, we assume that an event may cascade into a next level with 
probability (3 with associated additional damage Ai ■ Sq. When this occurs, the cascade may 
continue to the next level, again with probability /3 and further additional damage A2A1 ■ Sq. 
The probability for the incident to stop after n steps is P{n) = — f3). After n steps, the 
total damage is the sum of the damage at each level: Sn = SjiJ^^-^ Ai...Afc. Thus, Sn is the 



recurrence solution of the Kesten map {Sn = AnSn-i + Sq) p, l7|: As soon as amplification 
occurs (technically, some of the factors A^ are larger than 1), the distribution of losses is a 
power law, whose exponent /i is a function of /3 and of the distribution of the factors A^. 
In the case where all factors are equal to A, this model predicts three possible regimes for 
the distribution of damage: thinner than exponential for A < 1, exponential for A = 1, and 
power law for A > 1 with exponent /i = | ln/3|/ In A (See SM). 

Figure 1 (B) presents these different regimes and the corresponding parameters calibrated 
to the PSA curves and to the empirical records. We obtain (/^Farmer ~ 0.9; Apamicr ~ 1.05) 
compared with (/3emp. ~ 0.95; Acmp. ~ 1.10). Interpreted within this cascade model, the 
safety analysis leading to the PSA curve attributes roughly a 90% probability that an inci- 
dent having reached a given level of severity may cascade into the next one. To account for 
the observed distribution of losses, this number needs to be increased by just 5% to about 
a 95% probability of cascading. Thus, an underestimation of just 5% in the probability /3 
for a cascade to continue and in the additional amplifying factor A has the effect of leading 
to a significant underestimation of the distribution of losses, in particular for large events. 
The origin of this sensitivity stems from the proximity of f3 to the critical value 1, likely 
due to optimization as occurs in almost any human enterprise associated with the design, 
modeling, planning and operating of many complex real- world problems js- 10|. 

We conclude that probability safety analysis is currently not adapted to simulate large 
catastrophes. The recognition of (i) extreme heavy-tailed losses and (ii) the critical nature 
of cascades during incidents and accidents calls for the integration of the observations and 
statistics over a large heterogeneous set of incidents into the PSA framework, a procedure 
that is at present not implemented and seems still far from being considered. Moreover, 



4 



safety analysis should be a never ending process constantly building on past experience, 
including the evolving full distribution of losses, for the development and the implementation 
of ever improved measures based on model update 



[1] J.C. Lee and N.J. McCormick, Risk and Safety Analysis of Nuclear Systems, 1st. ed., Wiley 
(2011) 

[2] W. Kroger and E. Zio, Vulnerable systems. Springer (2011). 

[3] Bal Raj Sehgal, Light water reactor (LWR) safety Nuclear Eng. Tech., 38 (8), 697-732 (2006). 
[4] Sovacool, B., The costs of failure: A preliminary assessment of major energy accidents, 1907- 

2007, Energy Policy, 36, 1802-1820 (2008) 
[5] K. Peters, L. Buzna and D. Helbing (2008) Modelling of cascading effects and efficient response 

to disaster spreading in complex networks. Int. J. Critical Infrastructures 4, 1/2, 46-62. 
[6] Kesten H., Acta Math. 131, 207-248 (1973). 
[7] D. Sornette and R. Cont, J. Phys. I France 7, 431-444 (1997). 
[8] D. Sornette, J. Phys. I France 2, 2065-2073 (1992). 

[9] J. M. Carlson and John Doyle, Proc. Natl. Acad. Sci. USA 99 (Suppl 1), 2538-2545 (2002). 
[10] Sornette, D., Critical Phenomena in Natural Sciences 2nd ed.. Springer Series in Synergetics, 
Heidelberg (2004). 

[11] D. Sornette, A. B. Davis, K. Ide, K. R. Vixie, V. Pisarenko, and J. R. Kamm, Proc. Nat. 
Acad. Sci. USA 104 (16), 6562-6567 (2007). 



5 



(A) 



(B) 



Empirical Records 




10" 10' 10» 10^ 10^" 10^ 

S (2006 Dollars) 



.00 



0.98 



0.96 
0.94 



■"5 0.92 
CO 

rn 

O 0.90 
?-i 

Oh 

0.88 



0.86 



CD 
D) 




Empirical Records 


(0 

E 

(0 
TD 

"o 






3nential distribution 




Safety Analysis 


Q. 

X 
<D 




\ Extreme Risks 




Bounded \ 






Risks ^emi-Extreme 






Risks \ 



0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 



Damage A 



FIG. 1: (A) Distribution of losses S normalized for 1 nuclear plant x year, obtained (i) from nuclear safety 
analysis 3[ and (ii) from empirical records [J|. Safety analysis largely underestimates the losses due to 
nuclear incidents. The difference is striking in the tail of the distribution: the distribution obtained from 
the PSA method vanishes fast beyond $ 1 billion damage while empirical records exhibit a power law tail with 
exponent /i = 0.7 ± 0.1 with no apparent cut-off. (B) Phase diagram showing the three regions of fat-tail 
risks predicted by the cascade model: (i) bounded risks with mean loss and its variance defined, (ii) semi- 
extreme risks with only mean loss defined and (iii) extreme risks with unbounded mean loss and variance. 
Empirical records clearly identify nuclear accidents as extreme risks, whereas safety analysis predicts that 
damage following nuclear incidents is fat-tailed yet bounded. 



6 



SUPPLEMENTARY MATERIALS 



A. Robustness of Empirical Distribution 

To gain trust on the reported empirical power law distribution of losses, we assess its 
robustness over time. Indeed, an obvious concern with respect to the analysis presented in 
Figure [Si] is that we are combining events involving different types of nuclear facilities and 
of plants, in particular, with different technologies and generations as well as of varying op- 
erational contexts. But nuclear plants and their safety procedures are continuously updated, 
resulting in major technological improvements over time (Clery, 2011). This suggests that 
the distribution of losses should change to reflect that extreme risks are becoming less and 
less probable. Indeed, following Three Miles Island (1979) and Chernobyl (1986) accidents, 
it is a fact that safety has improved. The release of the WASH-1400 Report in the United 
States (1975) has consecrated the adoption of Probability Safety Analysis (PSA) compared 
with more simplistic safety methods, followed by widespread adoption by other countries 
(Rasmussen et al., 1975). Figure [ST] tests empirically these points. Panel (A) shows the cu- 
mulative number of significant events over time from Ref. jj]. Three regimes can clearly be 
distinguished. In particular, it is clear that, following Chernobyl's accident (April 1986), the 
rate of incidents has been reduced roughly by 70%, most likely as a result of additional safety 
measures worldwide (Clery, 2011). Figure [SI] (B) shows the three empirical distributions of 
damage in each of these three periods (first period ending at Three Miles Island incident, 
second period until Chernobyl and third period ending in 2011). It is remarkable that no 
significant difference can be found, notwithstanding the very different incident rates shown 
in Figure [SI] (A). We conclude that safety improvements of nuclear plants had very positive 
effects preventing initiating events, therefore reducing their occurrence rate. However, no 
significant change in the structure of the tail distribution (i.e. the relative likelihood of ex- 
treme risks compared with small risks) can be observed. This suggests that improved safety 
procedures and new technology have been quite successful in preventing incidents (as well 
as accidents). However, mitigation has not significantly improved. This is a paradox, since 
safety measures should also be designed to minimize the likelihood that, following initiation, 
incidents worsen and breach one or several of the seven defense-in-depth safety barriers that 
protect the infrastructure of a typical nuclear plant. Safety barriers include: (i) prevention 



7 



of deviation from normal operation, (ii) control of abnormal operation, (iii) control of acci- 
dents in design basis, (iv) internal accident management including confinement protection 
and (v) off-site emergency response (International Atomic Energy Agency, 1996). 



8 




FIG. SI: (A) Cumulative number of civil nuclear incidents over time since 1957. Three regimes can be 
distinguished: (i) the burgeoning nuclear industry with small but quickly increasing installed capacity with 
an average rate of 0.55 incidents per year; (ii) from the time of Three Mile Island to Chernobyl, a rate of 3.6 
incidents per year; (iii) the post-Chernobyl era is characterized by a rate of slightly above 2 incidents per 
year. (B) Test for stability of the empirical CCDF over three periods of nuclear power industry history, using 
adaptive kernel density estimators (Cranmer, 2001). (C) Calibration of the accident propagation model with 
expression For the safety analysis (Farmer curve), we find that the probability of propagation is (3 — 0.9 
(lower bound /3o5 = 0.9, upper bound /3q5 = 0.91 at 95% confidence intervals), and the damage factor is 
A = 1.05 (lower bound A05 — 1.04, upper bound Ag5 = 1.06, at 95% confidence interval) for Smin = 10^. 
Panel (D) shows the best fit of expression (O to the real data is found for f3 = 0.95 (lower bound A05 = 0.94, 
upper bound A95 = 0.96 and A = 1.10 (lower bound A05 = 1.08, upper boimd Ag5 = 1.15). 



9 



B. Derivations of the Model for homogenous factors = A 

Here, we provide a detailed study of the possible behaviors of the model, also around the 
critical point A = 1. 

Three regimes must be considered: 

1. For A < 1, the distribution is given by 

Pa<.(S >») = (!-, fl)(l-jf^)\ .:=;^>0^ (1) 

This distribution can be approximated in its central part, away from the maximum 
possible loss Smax, by a WeibuU distribution of the form 



Pr(S > s) ~ e-(^/'')' . (2) 

For A — 7- we have Smax — ^ +00 and, for s ^ Smax, expression ([T]) simplifies into a 
simple exponential function 

Pa^i-(^>s) ~e-l^°('^)l'^/^'' . (3) 

2. For A = 1, the distribution of losses is a simple exponential function since Sn = nSo 
is linear in the number n of stages and the probability of reaching stage n is the 
exponential P{n) = — Actually, the expression (jl]) becomes asymptotical 

Pa=i{S >s) = {1- /3)e-IM/3)l«/5o _ (4) 

3. For A > 1, the distribution of losses is of the form, 

P. (S >c)- ^ s*- c - (5) 

Pa>i{S >s)- ^^^^ ' ^ - a3I ' ^ - li^ ' (5) 

\ ' s*' 

which develops to a power law distribution of losses of the form Pr(loss > S) = C/ S'^ 
with n = c, when A — > +00. 



10 



C. Model calibration 

We calibrate the model by using expression ([5]) because, for both the Farmer curve and 
empirical records, the solution lies in the region A > 1. The calibration consists in finding (i) 
the probability /3 and (ii) the damage factor A for distribution from the ex-ante predictions 
obtained from the Farmer safety analysis and from the ex-post historical records. 

Both distributions are calibrated first by grid search. From the fitted parameters, one 
hundred distributions are generated and fitted again by grid search. The final parameters 
of each distribution are the median values of the bootstrapped distributions. Confidence 
intervals at 95% are given by the corresponding quantiles of the bootstrapped parameters. 
Figure El] show the best fits for the Farmer curve (panel C) and for empirical records (panel 
D). For the safety analysis (Farmer curve panel A), we find the probability of propagation 
equal to /3 = 0.9 (lower bound /3o5 = 0.90, upper bound /Jgs = 0.91), and damage factor 
equal to A = 1.05 (lower bound Aqs = 1.04, upper bound A95 = 1.06). Panel B for the real 
data shows that the best fit is found for /3 = 0.95 (lower bound A05 = 0.94), upper bound 
A95 = 0.96 and A = 1.10 (lower bound A05 = 1.08, upper bound A95 = 1.15). 

References 

Clery, D. (2011) Current designs address safety problems in Fukushima reactors Science 
331, 1506. 

Cranmer, K. (2001), Computer Physics Communications 3,198-207. 

International Atomic Energy Agency (1996), Defence in Depth in Nuclear Safety. 

Rasmussen, N. C. et al. Reactor safety study. An assessment of accident risks in U. S. 
commercial nuclear power plants, WASH-1400 (NUREG-75/014), U.S. Nuclear Regulatory 
Commission, (1975). 



